The following tutorial will let you reproduce the plots that we are going to create at the workshop using R.
Please read carefully and follow the steps. Wherever you see the Code icon on the right you can click on it to see the actual code used in that section.
To run R and RStudio on Binder, click on this badge - .
Additional instructions (and screenshots) on using R, RStudio and Binder are available in Workshop1 Tutorial on BB.
For this analysis we will again use some packages from the Tidyverse, but this time we load the specific packages (which are supposed to be pre-installed on your computer) to try and avoid having to download the entire tidyverse. Please install the packages if you’re getting error when running the library() command. In addition we will use the plotly package to create interactive plots.
To install these packages, we use the install.packages('package') command, please note that the package name need to be quoted and that we only need to perform it once (if we followed step 3b above, or every time we restart the server if we didn’t), or when we want or need to update the package. Once the package was installed, we can load its functions using the library(package) command. Note that in this case we use the package name without quotes!.
# install required packages - needed only once! (comment with a # after first use)
# install.packages(c("dplyr", "ggplot2", "readr","RColorBrewer", "Rmisc", "plotly"))
install.packages("tibble")
# load required packages
library(dplyr)
library(readr)
library(ggplot2)
library(RColorBrewer)More information on R packages can be found in this tutorial.
Now that we’ve got RStudio up and running and our packages installed and loaded, we can read data into R from our local computer or from web locations using dedicated functions specific to the file type (.csv, .txt, .xlsx, etc.).
We need to provide a variable name that will store the data and remember that R holds its variables in the computers RAM (which will be the limiting factor in terms of size of data that can be handled).
We will use the read_csv() command/function from the readr package (part of the tidyverse) to load the data from a file hosted on the web into a variable of type data frame (table). If we don’t want to use external packages, we can use the read.csv() function from base R (which will slightly change the structure of the resulting data frame).
We can explore the data by clicking on the table icon next to the variable name in RStudio “Environment” tab (top right pane), but that’s not a good practice because it won’t work with large data sets.
We can use built-in functions for a brief exploration, such as head() to show the first 10 rows of the data and str() or glimpse() for the type of data in each column:
## # A tibble: 6 x 6
## country year pop continent lifeExp gdpPercap
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Afghanistan 1952 8425333 Asia 28.8 779.
## 2 Afghanistan 1957 9240934 Asia 30.3 821.
## 3 Afghanistan 1962 10267083 Asia 32.0 853.
## 4 Afghanistan 1967 11537966 Asia 34.0 836.
## 5 Afghanistan 1972 13079460 Asia 36.1 740.
## 6 Afghanistan 1977 14880372 Asia 38.4 786.
## spec_tbl_df [1,704 x 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ country : chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ year : num [1:1704] 1952 1957 1962 1967 1972 ...
## $ pop : num [1:1704] 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: chr [1:1704] "Asia" "Asia" "Asia" "Asia" ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
## - attr(*, "spec")=
## .. cols(
## .. country = col_character(),
## .. year = col_double(),
## .. pop = col_double(),
## .. continent = col_character(),
## .. lifeExp = col_double(),
## .. gdpPercap = col_double()
## .. )
## Rows: 1,704
## Columns: 6
## $ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", ~
## $ year <dbl> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, ~
## $ pop <dbl> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12~
## $ continent <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asi~
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8~
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, ~
We can also produce some descriptive statistics to better understand the data and the nature of each variable. The summary() function (as can be guessed by its name) provides a quick summary of basic descriptive statistics, such as the mean, min, max and quantiles for continuous numerical values.
## country year pop continent
## Length:1704 Min. :1952 Min. :6.001e+04 Length:1704
## Class :character 1st Qu.:1966 1st Qu.:2.794e+06 Class :character
## Mode :character Median :1980 Median :7.024e+06 Mode :character
## Mean :1980 Mean :2.960e+07
## 3rd Qu.:1993 3rd Qu.:1.959e+07
## Max. :2007 Max. :1.319e+09
## lifeExp gdpPercap
## Min. :23.60 Min. : 241.2
## 1st Qu.:48.20 1st Qu.: 1202.1
## Median :60.71 Median : 3531.8
## Mean :59.47 Mean : 7215.3
## 3rd Qu.:70.85 3rd Qu.: 9325.5
## Max. :82.60 Max. :113523.1
If we want to see how it can be used to summarise nominal (categorical) data, we can apply it on the same data, but that had been imported using the read.csv() function, that automatically converts text columns into factors, which are vectors of strings/characters with predefined categories (note that this is no longer the default behaviour in R versions >4.0).
# apply summary on data imported with read.csv(), what are the differences?
summary(read.csv("https://tinyurl.com/gapdata", stringsAsFactors = TRUE))## country year pop continent
## Afghanistan: 12 Min. :1952 Min. :6.001e+04 Africa :624
## Albania : 12 1st Qu.:1966 1st Qu.:2.794e+06 Americas:300
## Algeria : 12 Median :1980 Median :7.024e+06 Asia :396
## Angola : 12 Mean :1980 Mean :2.960e+07 Europe :360
## Argentina : 12 3rd Qu.:1993 3rd Qu.:1.959e+07 Oceania : 24
## Australia : 12 Max. :2007 Max. :1.319e+09
## (Other) :1632
## lifeExp gdpPercap
## Min. :23.60 Min. : 241.2
## 1st Qu.:48.20 1st Qu.: 1202.1
## Median :60.71 Median : 3531.8
## Mean :59.47 Mean : 7215.3
## 3rd Qu.:70.85 3rd Qu.: 9325.5
## Max. :82.60 Max. :113523.1
##
We can see that when a variable is defined as a factor (contains categories), summary() will automatically summarise and count how many times wach category appears in the data.
We can also sort the data to look at extreme values. As opposed to using summary(), max() or min(), this way we see all the values in the other columns, so we can see not only the minimal recorded lifeExp, but also in which country and at which year it was observed.
## # A tibble: 1,704 x 6
## country year pop continent lifeExp gdpPercap
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Rwanda 1992 7290203 Africa 23.6 737.
## 2 Afghanistan 1952 8425333 Asia 28.8 779.
## 3 Gambia 1952 284320 Africa 30 485.
## 4 Angola 1952 4232095 Africa 30.0 3521.
## 5 Sierra Leone 1952 2143249 Africa 30.3 880.
## 6 Afghanistan 1957 9240934 Asia 30.3 821.
## 7 Cambodia 1977 6978607 Asia 31.2 525.
## 8 Mozambique 1952 6446316 Africa 31.3 469.
## 9 Sierra Leone 1957 2295678 Africa 31.6 1004.
## 10 Burkina Faso 1952 4469979 Africa 32.0 543.
## # ... with 1,694 more rows
## # A tibble: 1,704 x 6
## country year pop continent lifeExp gdpPercap
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Kuwait 1957 212846 Asia 58.0 113523.
## 2 Kuwait 1972 841934 Asia 67.7 109348.
## 3 Kuwait 1952 160000 Asia 55.6 108382.
## 4 Kuwait 1962 358266 Asia 60.5 95458.
## 5 Kuwait 1967 575003 Asia 64.6 80895.
## 6 Kuwait 1977 1140357 Asia 69.3 59265.
## 7 Norway 2007 4627926 Europe 80.2 49357.
## 8 Kuwait 2007 2505559 Asia 77.6 47307.
## 9 Singapore 2007 4553009 Asia 80.0 47143.
## 10 Norway 2002 4535591 Europe 79.0 44684.
## # ... with 1,694 more rows
we can use several graph types to look at our data and assess its distribution, outliers, categories and more. A good reference to visualisation types can be found at From Data to Viz website.
What type of variables do we have in each column? Knowing that can help us determine how we can visualise each one.
Do find out, we can look again at the results of summary() to refresh our memory (or scroll up).
We have two character categorical variables, country and continent and 4 numerical variables (year, pop, lifeExp and gdpPercap). Are all the numerical variables similar? Do we treat year as a continuous numerical variable or as a nominal one? We need to decide based on our analysis.
A histogram counts how many observations we have from each value and lets us assess how the data is distributed. When applied to categorical data, it shows us a graphic visualisation of the same output as running gapdata %>% dplyr::count(continent) (I’m using the double colon :: to tell R to use the count() function from dplyr package).
## # A tibble: 5 x 2
## continent n
## <chr> <int>
## 1 Africa 624
## 2 Americas 300
## 3 Asia 396
## 4 Europe 360
## 5 Oceania 24
ggplot(gapdata, aes(x=continent, fill=continent)) +
geom_histogram(stat = "count", width=0.8) +
scale_fill_brewer(palette = "Set1") + theme_bw(16)## Warning: Ignoring unknown parameters: binwidth, bins, pad
Figure 3: Histogram of continent data.
The real power of a histogram though comes when we apply it to a continuous numerical variable. However, when the data is truly continuous, the range need to be divided to “bins” into which each observation falls and then the sum of observation in each bin is presented.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure 4: Histogram of life expectancy data.
We can increase the number of bins for higher resolution, or decrease it for a more coarse look. We can also colour each continent to get a better understanding of the distribution of values from each continent.
ggplot(gapdata, aes(x=lifeExp, fill=continent)) +
geom_histogram(bins = 100) +
scale_fill_brewer(palette = "Set2") +
theme_bw(16)Figure 5: Histogram of life expectancy data by continent.
Can you try and plot a histogram of the years, what type of data is it based on the output?
Another variant of the histogram is a density plot, which is a “smoothed” version of the histogram. If we apply some transparency to the fill colour (using the alpha=0.5 argument), we can see where the data from the continents overlap (see Figure 6) .
ggplot(gapdata, aes(x=lifeExp, fill=continent)) +
geom_density(alpha=0.5) +
scale_fill_brewer(palette = "Dark2") +
theme_bw(16)Figure 6: Density plot of life expectancy data by continent.
A more detailed method to look at distribution of continuous numerical data is using a boxplot. These are visual representations of the descriptive statistics provided by summary() command. Each boxplot visualises five summary statistics (the median, two hinges and two whiskers), and all “outlying” points individually. The hinges (top and bottom of box) represent the range between the 25%-75% quantiles, while the length whiskers represent 1.5*IQR (interquantile range) from the hinges. Data points that fall outside this range are considered “outliers” and are plotted individually.
ggplot(gapdata, aes(x=continent, y=lifeExp, colour=continent)) +
geom_boxplot(width=0.25) +
scale_colour_brewer(palette = "Set1") +
theme_bw(16)Figure 7: Box plots of life expectancy data by continent.
A box plot representation gives us a better tool to compare groups.
We saw that the boxplot let us get a better understanding of how the data is distributed around the median, but it doesn’t give us the notion of the density of data observations that were collected along the range (like the histogram did).
A violin plot brings the best in both worlds and it is basically a vertical, 2-sided density plots that are plotted along the Y-axis (see Figure 8).
ggplot(gapdata, aes(x=continent, y=lifeExp, fill=continent)) +
geom_violin(draw_quantiles = c(0.2,0.8), trim = FALSE) +
scale_fill_brewer(palette = "Dark2") +
theme_bw(16)Figure 8: Violin plots of life expectancy data by continent.
Note that in R we can easily determine which quantiles we wish to plot. If we specify trim=TRUE, the violin plots will be trimmed and will not include the outliers.
We can then go ahead and combine violin with boxplots, along with a few adjustments, such as calculating and plotting the average value, to make the plot “publication” ready (Figure 9).
# calculate mean
mean_values <- gapdata %>% group_by(continent) %>% dplyr::summarise(lifeExp=mean(lifeExp))
ggplot(gapdata, aes(x=continent, y=lifeExp, colour=continent)) +
geom_violin(aes(fill=continent), alpha=0.6) +
geom_boxplot(width=0.15) +
geom_point(data = mean_values, size=2.5, pch=18) +
scale_fill_brewer(palette = "Dark2") +
scale_colour_brewer(palette = "Dark2") +
labs(y="Life Expectancy", fill="Continent", colour="Continent", x="") +
theme_bw(16)Figure 9: Combined violin and box plots of life expectancy data by continent.
Another option to show the distribution of data points, along with some descriptive statistics, is by using “Raincloud Plots”, which can convey a lot of useful information. See details in Visualizing Distributions with Raincloud Plots using ggplot2 (blog post)
📢 New blog post!
— Cédric Scherer (@CedScherer) June 6, 2021
I finally finished the post on "Visualizing Distributions with Raincloud Plots with #ggplot2" showing why such hybrid charts are a good alternative to boxplots and numerous ways how to create them in #rstats with #ggplot2. #dataviz
🔗 https://t.co/ImrmFteeig pic.twitter.com/RGk15s20e6
Sometimes it is very useful to be able to interact with the plot, to interrogate outliers and get the values in real time. We can use the plotly package for that on the boxplot we created earlier.
plot <- ggplot(gapdata, aes(x=continent, y=lifeExp, colour=continent)) +
geom_boxplot(width=0.25) +
scale_colour_brewer(palette = "Set1") +
labs(y="Life Expectancy", colour="Continent", x="") +
theme_bw(16)
plotly::ggplotly(plot)Figure 10: Interactive box plots of life expectancy data by continent (powered by Plotly).
Use your mouse to hover over the graph in Figure 10 to get details of outliers and summary statistics for each continent.
This is a great way to share plots between students/supervisors/collaborators to engage discussion on data interpretation and analysis and a tool to allow the public to interact with plots that are hosted online (ask me how!).
After we’ve finished working on Binder we can download the R script that we wrote and any output files (summary tables and figures). We can access those files by using the files tab in RStudio (bottom right pane).
Select the files/folders that you would like to download and click on More Export… (see screenshot in Figure 1 below) to save the file on your computer.
Figure 1: Download files from Binder/RStudio screenshot.
ggplot2 plots (link)Stuck with a boring boxplot? Check out this brilliant "evolution of ggplot" tutorial by @cedscherer
— We are R-Ladies (@WeAreRLadies) July 8, 2020
https://t.co/LNLf1iWGaI
4/n pic.twitter.com/GDd5TPJ1mx
Please contact me at i.bar@griffith.edu.au for any questions or comments.